Introduction

Allrecipes.com

If you’ve ever searched online for a recipe or asked “what should I make for dinner” in a search engine, one of the results is most likely from allrecipes.com.

Allrecipes.com is recipe-sharing platform with over 100,000 recipes and 60 million users globally. Users can submit their own recipes as well as interact with other users by commenting, reviewing, or rating recipes. Allrecipes is unique as it is a public forum for recipes and a community for anyone who wants to cook, rather than a carefully curated blog. You can find family heirlooms like Jewish Grandma’s Best Beef Brisket or simple recipes like these basic crepes (which I have tried and can attest to!)

In addition to being a great resource when you’re in a cooking rut, it has a trove of data on each page. So much so, that Brian Mubia took notice and decided to create the tastyR package.

tastyR package

The tastyR package contains two datasets, allrecipes and cuisines. For this project, we’ll dive into cuisines - a dataset containing over 2,000 recipes from allrecipes with information on ingredients, cuisine, nutrition, reviews, and ratings.

Analysis Plan

After reviewing the cuisines dataset, I was most interested in two components: the cuisine variable, which closely parallels country of origin, and the ingredients variable. We often group country’s cuisines into broader categories by geographic location — for example, Italy, Greece, and Turkey are commonly considered Mediterranean food. I was curious to see whether that intuition holds up based on ingredient usage or if we see that geographically distant cuisines have more in common than we might have assumed.

To help guide the project, I came up with some questions to try and answer.

  • What are the most common ingredients overall and by cuisine?
  • Can we identify cuisines that have similar ingredient profiles?
  • Are cuisines with similar ingredient profiles geographically close?

To answer these questions, we will:

  1. Produce histograms to display common ingredients.
  2. Use PCA to reduce ingredients to a few components.
  3. Use the UN Geoscheme to define geographic regions and visually review if countries that are close together in the PCA space are also within the same geographic regions.

Preparation

Before diving into the data analysis, I want to go over the contents of the dataset, data cleaning steps and how ingredients were tokenized.

Dataset Contents

Table 1. Summary of Cuisine Dataset
variable type label
country character Cuisine
name character Name of Recipe
url character URL
author character Author
date_published Date Date Published or Last Updated
ingredients character List of Ingredients
calories integer Calories per Serving
fat integer Fat per Serving
carbs integer Carbs per Serving
protein integer Protein per Serving
avg_rating numeric Average Ratings
total_ratings integer Total Number of Ratings
reviews integer Total Number of Reviews
prep_time integer Prep Time (in minutes)
cook_time integer Cook Time (in minutes)
total_time integer Total Time (in minutes)
servings integer Number of Servings

The cuisines dataset contains 2,218 records with 17 different variables, described above. Records are uniquely identified by name and author.

Ingredients are comma-delimited with measurement and units but not standardized. Fat, carbs, and protein are measured in grams. Ratings are on a 1-star to 5-star rating scale.

Something to note is that total ratings and reviews are erroneously truncated to the thousands, unless there were less than 1000 ratings total. This was discovered when creating frequency tables and confirmed online (Github - TidyTuesday Data).

In the exploratory data analysis, we will cover some basic descriptive statistics on some of these variables to give us a general idea of the recipes we are working with.

Data Cleaning

For data cleaning, the following steps were taken:

  • Removed records with missing values in critical fields (name, author, cuisine, and ingredients)
  • Identified and removed duplicate recipes by name:
    • Exact duplicates by name were identified
    • Similar duplicates by name were identified by using the functions stringdistmatrix,hclust, and cutree. The threshold for cutree was 0.05 and duplicates were manually reviewed to ensure that they were similar enough to be considered the same and the threshold was appropriate.
    • The recipe with the highest number of ratings was kept out of the pair

Outliers in numeric variables were not examined as these variables will be not used in the main analysis. After data cleaning, 9 records were removed from the initial dataset.

Ingredient Standardization

Below shows example of what the raw ingredients variable look like.

##                                                                                                                                                                                                                                                                                                                 ingredients
## 1                                                                          1 pound sliced bacon, diced, 1 medium sweet onion, chopped, 9 large eggs, lightly beaten, 4 cups frozen shredded hash brown potatoes, thawed, 2 cups shredded Cheddar cheese, 1 ½ cups small curd cottage cheese, 1 ¼ cups shredded Swiss cheese
## 2                                                                                                                                                                                                   3  egg yolks, 1 tablespoon lemon juice, ¼ teaspoon Dijon mustard, 1 dash hot pepper sauce (e.g. Tabasco™), ½ cup butter
## 3 oil for deep frying, 1 cup unbleached all-purpose flour, 2 teaspoons salt, ½ teaspoon ground black pepper, ½ teaspoon cayenne pepper, ½ teaspoon paprika, ¼ teaspoon garlic powder, 1 large egg, 1 cup milk, 3  skinless, boneless chicken breasts, cut into 1/2-inch strips, ¼ cup hot pepper sauce, 1 tablespoon butter
## 4                                                                                                                                                                                       1 orange, 1 lemon, 1 lime, 1 (750 milliliter) bottle dry red wine, 1 ½ cups rum, 1 cup orange juice, ½ cup white sugar, or to taste
## 5                                                                                                         4 skinless, boneless chicken breast halves - pounded to ½-inch thickness, salt and pepper to taste, 2 tablespoons all-purpose flour, 1  egg, beaten, 1 cup panko bread crumbs, 1 cup oil for frying, or as needed

As you can see, it is contains a lot of information, represented in different forms. For example, there is additional text within parentheses as well as measurements and method of preparation (ex. “chopped”).

For the purposes of PCA, we will standardized and tokenized the ingredients so we get one row per ingredient per recipe, with no measurement, unit, or additional information. Adjectives that are unnecessary such as small,large etc. will be removed.

The following steps were taken:

  • Standardized text by trimming whitespace, removing percentages, parentheses, and non‑ASCII characters
  • Split multi‑ingredient strings into separate rows
  • Identified and removed raw units and numeric measurements using regex patterns
  • Removed prepositions (e.g., “with”, “or”, “for”, “from”, “in”) and adjectives
  • This was done using NLP annotation with UDPipe to add part‑of‑speech tags
  • Standardized plural forms of the food to the singular version
  • Identified and rectified misspellings by using the functions stringdistmatrix,hclust, and cutree like above.The threshold for cutree was 0.10 and clusters were manually reviewed to ensure that they were true misspellings.

Standardizing the ingredients was not a trivial effort. Given the amount of recipes and ingredients, it is not guaranteed that every case was accounted for in this step.

After these steps, this is what the ingredients column became:

##              food
## 1           bacon
## 2           onion
## 3             egg
## 4          potato
## 5  cheddar cheese
## 6  cottage cheese
## 7    swiss cheese
## 8            yolk
## 9     lemon juice
## 10  dijon mustard
## 11   pepper sauce
## 12         butter
## 13          flour
## 14           salt
## 15         pepper
## 16 cayenne pepper
## 17        paprika
## 18  garlic powder
## 19            egg
## 20           milk

In the data analysis section, we will discuss the top ingredients.

Data Analysis

Descriptive Analysis

To get a sense of the data we are working with, I produced some basic graphs and a table with descriptive statistics on most of the variables.

We can see above that we are dealing with recipes mostly from the last 5 years, so they should reflect current food trends.

The above graphs display the spread of nutritional variables. With 50% of recipes having less than 11 grams of protein, we may have more recipes that are vegetarian rather than meat-based. The median value of calories is about 320 and so the recipes are most likely moderate and not indulgent.

The table below includes a more numerical look at the raw variables.

Table 2. Descriptive Statistics for Variables
variable level statistics
Number of Records NA 2209
date_published [2005,2010) 1 (0.05%)
date_published [2010,2015) 10 (0.45%)
date_published [2015,2020) 48 (2.17%)
date_published [2020,2025] 2150 (97.33%)
calories mean(sd) 358.16 (239.27)
calories median(q1,q3) 319.5 (190, 477)
fat mean(sd) 18.76 (16.96)
fat median(q1,q3) 15 (7, 26)
carbs mean(sd) 31.87 (25.87)
carbs median(q1,q3) 26 (13, 45)
protein mean(sd) 16.61 (16.3)
protein median(q1,q3) 11 (4, 25)
avg_rating mean(sd) 4.51 (0.4)
avg_rating median(q1,q3) 4.6 (4.3, 4.8)
reviews mean(sd) 77.06 (142.25)
reviews median(q1,q3) 21 (6, 74)
prep_time mean(sd) 21.53 (60.84)
prep_time median(q1,q3) 15 (10, 25)
cook_time mean(sd) 41.8 (63.23)
cook_time median(q1,q3) 25 (10, 45)
total_time mean(sd) 171 (642.8)
total_time median(q1,q3) 60 (35, 120)
servings mean(sd) 10.47 (13.44)
servings median(q1,q3) 8 (4, 12)

Next, we will look at the two critical variables for this analysis - cuisine and ingredients. Below are graphs detailing the proportion of cuisines by recipe and the most common ingredients.

There are over 40 different cuisines, most of which are directly related to a single country with the exception of a few such as Jewish, Cajun and Creole, Amish and Mennonite, and Southern Recipes. The largest percentage of recipes comes from Brazil and Filipino while the lowest is Belgian. There are several missing cuisines on this dataset and it is a limitation in this analysis. For example, we do not have any data on countries in Africa besides South Africa.

The top 25 ingredients are not very surprising and make intuitive sense. I am a little surprised that onion is first compared to salt and water. For the rest of the graphs and the PCA analysis, I will remove the top 10 ingredients as I don’t think they are very informative in defining a country’s ingredient profile.

Top 5 Ingredients by Cuisine

To answer what are the most common ingredients by cuisine, the following histograms graphs were created. This allows us to step through all the cuisines and take notice of which ingredients appeared the most. I organized them by the UN Geoscheme for regions and when a cuisine was not a country, I grouped it with where the cuisine is typically associated with. The only cuisine I did not do this was Jewish cuisine since they cover multiple areas and are dispersed globally.We will use the grouping to color the individual cuisine when we graph their PC1 and PC2.

To get a look at the top 50 ingredients and all the cuisines at once, I created this heatmap. This allows us to identify which cuisines have similar proportion for each of the top 50 ingredients. The y-axis is grouped by region and the x-axis is ordered from most common to least common.

PCA

  • Steps
    • Filtered out uninformative ingredients (e.g., sugar, onion, butter, salt, etc.).
    • Selected the top 200(in at least 5% of recipes) most frequent ingredients for analysis.
    • Built a country × ingredient frequency table.
    • Normalized counts by row totals to create proportional ingredient profiles per country.
    • Applied prcomp() to the ingredient proportion matrix with centering and scaling.
    • Generated a scree plot to visualize variance explained by each component.
    • Created scatterplots of PC1 vs PC2, colored by region.

cat('<iframe src="results/plots/tsne_recipes.html" width="100%" height="600" frameborder="0"></iframe>')
##                  ingredient       PC1
## ginger root     ginger root 0.1531790
## cilantro           cilantro 0.1290366
## sprout               sprout 0.1264313
## chicken thigh chicken thigh 0.1233345
## star anise       star anise 0.1231000
## ginger               ginger 0.1183448
## rice noodle     rice noodle 0.1140547
## cilantro leaf cilantro leaf 0.1120065
## coriander         coriander 0.1070627
## peanut               peanut 0.1064545
##                          ingredient       PC2
## rice vinegar           rice vinegar 0.1764492
## ketchup                     ketchup 0.1684169
## ginger                       ginger 0.1598751
## rice wine vinegar rice wine vinegar 0.1410089
## sprout                       sprout 0.1402169
## chile paste             chile paste 0.1383847
## ginger root             ginger root 0.1297077
## tofu                           tofu 0.1291296
## chicken thigh         chicken thigh 0.1254302
## honey                         honey 0.1225864

Discussion

Limitations

Next Steps